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Preface 



These proceedings contain the papers accepted for presentation at the First European 
Semantic Web Symposium (ESWS 2004) held on Crete, Greece, May 10-12, 2004. 

Given its status as an inaugural event, the organizers were delighted to receive 79 
high-quality submissions. Most papers were reviewed by at least three referees, with 
the review results coordinated by the academic and industrial track chairs. In total, 27 
papers were accepted for the academic track and 6 papers were accepted for the 
industrial track. The papers span a wide range of topics from the Semantic Web area, 
from infrastructure and ontology engineering to applications. 

The high quality of this symposium is due to the efforts of many people. Jos de Bruijn 
in particular worked hard in a number of areas, including submissions management, 
publicity and the poster program. We woidd also like to thank Martin Doerr for local 
arrangements, Johannes Breitfuss for the WWW site, the Program Committee and 
additional reviewers for their invaluable support and the sponsors for their financial 
support. 
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Towards On-the-Fly Ontology Construction - 
Focusing on Ontology Quality Improvement 



Naoki Sugiura 1 , Yoshihiro Shigeta 1 , Naoki Fukuta 1 , Noriaki Izumi 2 , and 

Takahira Yamaguchi 1 

1 Shizuoka University, 3-5-1 Johoku, Hamamatsu, Shizuoka 432-8011, Japan, 
sugiura@ks . cs . inf . shizuoka. ac . jp, 
http: / / mmm.semanticweb.org 

2 National Institute of AIST, 2-41-6, Aomi, Koto-ku, Tokyo, Japan 



Abstract. In order to realize the on-the-fly ontology construction for 
the Semantic Web, this paper proposes DODDLE-R, a support environ- 
ment for user-centered ontology development. It consists of two main 
parts: pre-processing part and quality improvement part. Pre-processing 
part generates a prototype ontology semi-automatically, and quality im- 
provement part supports the refinement of it interactively. As we believe 
that careful construction of ontologies from preliminary phase is more 
efficient than attempting generate ontologies full-automatically (it may 
cause too many modification by hand), quality improvement part plays 
significant role in DODDLE-R. Through interactive support for improv- 
ing the quality of prototype ontology, OWL-Lite level ontology, which 
consists of taxonomic relationships (class - sub class relationship) and 
non-taxonomic relationships (defined as property), is constructed effi- 
ciently. 



1 Introduction 

As the scale of the Web becomes huge, it is becoming more difficult to find ap- 
propriate information on it. When a user uses a search engine, there are many 
Web pages or Web services which are syntactically matched with user’s input 
words but semantically incorrect and not suitable for user’s intention. In or- 
der to defeat this situation, Semantic Web[l] is now gathering attentions from 
researchers in wide area. Adding semantics (meta-data) to the Web contents, 
software agents are able to understand and even infer Web resources. To real- 
ize such paradigm, the role of ontologies [2] [3] is important in terms of sharing 
common understanding among both people and software agents [4]. On the one 
hand, in knowledge engineering field ontologies have been developed for particu- 
lar knowledge system mainly to reuse domain knowledge. On the other hand, for 
the Semantic Web, ontologies are constructed in distributed places or domain, 
and then mapped each other. For this purpose, it is an urgent task to realize 
a software environment for rapid construction of ontologies for each domain. 
Towards the on-the-fly ontology construction, many researches are focusing on 
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automatic ontology construction from existing Web resources, such as dictionar- 
ies, by machine processing with concept extraction algorithms. However, even 
if the machine produces ontologies automatically, users still need to check the 
output ontology. It may be a great burden for users to check all the correctness 
of the ontology and modify it, especially if the scale of automatically produced 
ontology is large. Considering such situation, we believe that the most impor- 
tant aspect of the on-the-fly ontology construction is that how efficiently the 
user, such as domain experts, are able to check the output ontology in order to 
make Semantic Web contents available to the public. For this reason, ontologies 
should be constructed not fully automatically, but through interactive support 
by software environment from the early stage of ontology construction. Although 
it may seem to be contradiction in terms of efficiency, the total cost of ontology 
construction would become less than automatic construction because if the on- 
tology is constructed with careful interaction between the system and the user, 
less miss-construction will be happened. It also means that high-quality ontol- 
ogy would be constructed. In this paper, we propose a software environment 
for user-centered on-the-fly ontology construction named DODDLE-R (Domain 
Ontology rapiD DeveLopment Environment - RDF [5] extension). The architec- 
ture of DODDLE-R is re-designed based on DODDLE-II [6], the former version 
of DODDLE-R. Although DODDLE-II has already provided interactive support 
for ontology construction, the system architecture is not well-considered and 
sophisticated. The DODDLE-R system is modularized into machine-processing 
module and user-interaction module in order to separate pre-processing part 
and user-centered quality management part specifically. Especially, to realize the 
user-centered environment, DODDLE-R dedicates to the quality improvement 
part. It enables us to develop ontologies with interactive indication of which part 
of ontology should be modified. The system supports the construction of both 
taxonomic relationships and non-taxonomic relationships in ontologies. Addi- 
tionally, because DODDLE-II has been built for ontology construction not for 
the Semantic Web but for typical knowledge systems, it needs some extensions 
for the Semantic Web such as OWL (Web Ontology Language) [7] import and 
export facility. DODDLE-R supports OWL-Lite level ontology construction be- 
cause if we think of user-centered ontology construction, OWL-DL or OWL-Full 
sounds too complicated for human to understand thoroughly. DODDLE-R con- 
tributes the evolution of ontology construction and the Semantic Web. 

2 System Design of DODDLE-R 

Fig. 1 shows the overview of DODDLE-R. The main feature of DODDLE-R 
is the modularized two parts - pre-processing part and quality improvement 
part. In pre-processing part, the system generates the basis of the ontology, a 
taxonomy and extracted concept pairs, by referring to WordNet[8] as an MRD 
(Machine Readable Dictionary) and domain specific text corpus. A taxonomy 
is a hierarchy of IS-A relationship. Concept pairs are extracted based on co- 
occurrency by using statistic methods. These pairs are the candidates which has 
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Fig. 1. DODDLE-R overview 



significant relationships. A user identifies some relationship between concepts 
in the pairs. In quality improvement part, the prototype ontology produced by 
pre-processing part is modified by a user through interactive support by the 
system. 

2.1 Pre-processing Part 

In pre-processing part, the system generates the basis of output ontology for 
further modification by a user. Fig. 2 describes the procedure of pre-processing 
part. This part consists of three sub-parts: input concept selection, taxonomy 
building, and related concept pair acquisition. First, as input of the system, 
several domain specific terms are selected by a user. The system shows a list of 
noun concepts in the domain specific text corpus as candidates of input concept . 
At this phase, a user also identifies the sense of terms to map those terms to 
concepts in WordNet. 

For building taxonomic relationship (class - sub class relationship) of an on- 
tology, the system attempts to extract “best-matched concepts”. That is, “con- 
cept matching” between input concepts and WordNet concepts is done, and 
matched nodes are extracted, and then merged at each root nodes. To extract 
related concept pairs from domain specific text corpus as a basis of identifying 
non-taxonomic relationships (such as “part-of” relationship), statistic methods 
are applied. In particular, WordSpace[9] and an association rule algorithm [10] 
are used in this part and these methods attempt to identify significantly related 
concept pairs. 



Construction of WordSpace WordSpace is constructed as shown in Fig. 3. 

1. Extraction of high-frequency f -grams Since letter-by-letter co-occurrence in- 
formation becomes too much and so often irrelevant, we take term-by-term co- 
occurrence information in four words (4-gram) as the primitive to make up co- 
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Fig. 2. Pre-processing Part 



occurrence matrix useful to represent context of a text based on experimented 
results. We take high frequency 4-grams in order to make up WordSpace. 

2. Construction of collocation matrix A collocation matrix is constructed in order 
to compare the context of two 4-grams. Element aij in this matrix is the number 
of 4-gram /j which comes up just before 4-gram fj (called collocation area). The 
collocation matrix counts how many other 4-grams come up before the target 
4-gram. Each column of this matrix is the 4~ gram vector of the 4-gram /. 

3. Construction of context vectors A context vector represents context of a word 
or phrase in a text. A sum of 4-gram vectors around appearance place of a word 
or phrase (called context area) is a context vector of a word or phrase in the 
place. 

4- Construction of word vectors A word vector is a sum of context vectors at 
all appearance places of a word or phrase within texts, and can be expressed 
with Eq.l. Here, t{w) is a vector representation of a word or phrase w, C(w) is 
appearance places of a word or phrase w in a text, and <p(f) is a 4-gram vector 
of a 4-gram /. A set of vector t(w) is WordSpace. 

t(w) = ( ^2 <?(/)) W 

iec(w) f close to i 

5. Construction of vector representations of all concepts The best matched 
“synset” of each input terms in WordNet is already specified, and a sum of 
the word vector contained in these synsets is set to the vector representation of 
a concept corresponding to a input term. The concept label is the input term. 

6. Construction of a set of similar concept pairs Vector representations of all 
concepts are obtained by constructing WordSpace. Similarity between concepts 
is obtained from inner products in all the combination of these vectors. Then we 
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define certain threshold for this similarity. A concept pair with similarity beyond 
the threshold is extracted as a similar concept pair. 



Finding Association Rules between Input Terms The basic association 
rule algorithm is provided with a set of transactions, T := {ti \ i = l..n}, where 
each transaction tj consists of a set of items, t t = {a,j | j = l..rn,,a,j £ C} 
and each item aij is form a set of concepts C. The algorithm finds association 
rules Xk => Yk : (Xk, Yk C C, Xk T Yfc = {}) such that measures for support and 
confidence exceed user-defined thresholds. Thereby, support of a rule Xk =$■ Yk 
is the percentage of transactions that contain Xk U Yk as a subset (Eq.2)and 
confidence for the rule is defined as the percentage of transactions that Yk is 
seen when Xk appears in a transaction (Eq.3). 

support(X k =* Y k ) = KMXfcUyfc - tJ l (2) 

n 

/ • j I I U I A; T ti} | f ^ 

confidence(X k =4- Y k ) = — — — (3) 

| \t'i \ -&k Y ti\ \ 

As we regard input terms as items and sentences in text corpus as trans- 
actions, DODDLE-R finds associations between terms in text corpus. Based on 
experimented results, we define the threshold of support as 0.4% and the thresh- 
old of confidence as 80%. When an association rule between terms exceeds both 
thresholds, the pair of terms are extracted as candidates for non-taxonomic re- 
lationships. 
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Fig. 5. Matched Result Analysis Fig. 6. Trimmed Result Analysis 



2.2 Quality Improvement Part 

In order to improve the quality of the pre-processed ontology, the quality im- 
provement part works interactively with a user. Fig. 4 shows the procedure of 
this part. Because the pre-processed taxonomy is constructed from a general 
ontology, we need to adjust the taxonomy to the specific domain considering 
an issue called Concept Drift. It means that the position of particular concepts 
changes depending on the domain. For concept drift management, DODDLE-R 
applies two strategies: Matched Result Analysis (Fig. 5) and Trimmed Result 
Analysis (Fig. 6 ). 

In Matched Result Analysis, the system divides the taxonomy into PABs 
(PAths including only Best matched concepts) and STMs (SubTrees that in- 
cludes best-matched concepts and other concepts and so can be Moved) and in- 
dicates on the screen. PABs are paths that include only best-matched concepts 
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that have senses suitable for the given domain. STMs are subtrees of which root 
is an internal concept of WordNet and its subordinates are all best-matched con- 
cepts. Because the sense of an internal concept has not been identified by a user 
yet, STMs may be moved to other places for the concept adjustment to the do- 
main. In addition, for Trimmed Result Analysis, the system counts the number 
of internal concepts when the part was trimmed. By considering this number as 
the original distance between those two concepts, the system indicates to move 
the lower concept to other places. 

As a facility for related concept pair discovery, there are functions that al- 
low users to attempt some ways to improve the quality of extracted concept 
pairs through trial and error by changing parameters of statistic methods. Users 
can re-adjust the parameters of WordSpace and association rule algorithm and 
check the result. After that, the system generates “Concept Specification Tem- 
plates” from by using the results. It consists of some concept pairs which have 
considerable relationship considering the result value of statistic methods. 

By referring to the constructed domain specific taxonomic relationship and 
the “Concept Specification Templates” , a user develops a domain ontology. 

3 Implementation 

In this section, we describe the system architecture from the aspect of sys- 
tem implementation. DODDLE-R support environment for ontology construc- 
tion is realized in conjunction with MR 3 (Meta-Model Management based on 
RDF(S)[11] Revision Reflection) [12]. MR 3 is an RDF(S) graphical editor with 
meta-model management facility such as consistency checking of classes and a 
model in which these classes are used as the type of instances. Fig. 7 shows the 
relationship between DODDLE-R and MR 3 in terms of system implementa- 
tion. Both MR 3 and DODDLE-R are implemented in Java language (works on 
Java 2 or higher). MR 3 is implemented using JGraph[13] for RDF(S) graph 
visualization, and Jena 2 Semantic Web Framework[14] for enabling the use of 
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Fig. 8. Quality improvement process with DODDLE-R graphical user interface 



Semantic Web standards such as RDF, RDFS, N-triple and OWL. By using these 
libraries, MR? is implemented as an environment for graphical representation 
of the Semantic Web contents. Additionally, because MR 3 also has plug-in fa- 
cility to extend its functionality, it can provide some other functions such as the 
connectivity to Sesame RDF(S) server [15]. 

On top of MR 3 base environment, DODDLE-R is implemented as a support 
environment for ontology construction. Fig. 8 depicts the procedure of quality 
improvement with graphical user interface of the system. DODDLE-R’s graphical 
user interface consists of an ontology information viewer, a corpus viewer and 
a non-taxonomic relationship acquisition window as in Fig. 9. The ontology 
information viewer shows the information about particular concepts such as 
the dictionary definition of the concept, the distance from default root node 
of ontology. In addition, generated hierarchies are visualized by MR 3 graph 
editor. On the editor, the system indicates the parts of ontologies which may be 
modified to make it suitable for the domain according to matched result analysis 
and trimmed result analysis. The corpus viewer shows the domain specific text 
corpus which has been referred to acquire related concept pairs by WordSpace 
and an association rule. When the user clicks a concept on the concept hierarchy, 
the corpus viewer highlights related terms in the corpus so that the user can 
see how the term or concept is used in the actual text. The non-taxonomic 
relationship acquisition window is used for setting parameters for WordSpace 
and an association rule to apply for the domain specific text corpus in order to 
generate significantly related concept pairs. For WordSpace, there are parameters 
such as the gram number (default gram number is four) , minimum N-gram count 
(to extract high-frequency grams only), front scope and behind scope in the text. 
For an association rule, minimum confidence and minimum support are able to 
be set by the user. 
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Fig. 9. A graphical user interface for non-taxonomic relationship management 



4 An Example of Ontology Construction with 
DODDLE-R 

In this section, we show a brief example of an ontology construction with 
DODDLE-R. As a domain specific text corpus for the reference of this ontology 
construction, we selected the text in CISG (Contracts for the International Sale 
of Goods) [16] for the particular field of law to compare with the case study which 
has been done by using DODDLE-II. This corpus is composed of approximately 
10,000 words. 

4.1 Input Concept Selection 

Before starting pre-processing part, a user needs to select some terms as the 
input. As input of DODDLE-R, the user needs to associate those terms with 
concepts in WordNet. For example, the user decide which “concept” (or synset) 
in WordNet is suitable for the term “party” (the noun “party” has 5 senses as in 
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Sense 1 

party, political party — (an organization to gain political power; 

"in 1992 Perot tried to organize a third party at the national level") 
=> organization, organisation — (a group of people who work 
together) 

=> social group — (people sharing some social relation) 

=> group, grouping — (any number of entities (members) 
considered as a unit) 

Sense 2 

party — (an occasion on which people can assemble for social 
interaction and entertainment; "he planned a party to celebrate 
Bastille Day") 

=> affair, occasion, social occasion — (a vaguely specified 
social event; "the party was quite an affair") 

=> social event — (an event characteristic of persons 
forming groups) 

=> event — (something that happens at a given place 
and time) 



Fig. 10. WordNet concepts for the word “party” 



Fig. 10). By referring to the synset and term’s definition, the user selects Sence 
3 as a concept for the word “party” . 



4.2 Pre-processing Part 

After the user apply selected concepts for the system, a prototype ontology is 
produced. (A) in Fig. 11 describes the initial model of the taxonomic relationsip. 
Also related concept pairs are extracted by statistic methods such as WordSpace 
and assocciation rule by default parameter. 



4.3 Quality Improvement Part 

After the pre-processing part, there are prototype taxonomy and candidates of 
concept pairs for concept specification. However, they are just processed auto- 
matically and we need to adjust them to actual domain. 

(B) in Fig. 11 shows the display of concept drift management. The system 
indicates some groups of concepts in the taxonomy so that the user can decide 
which part should be modified. 

Also the related concept pairs may be re-extracted by setting the parameters 
of statistic methods and attempting to get suitable number of concept pairs. 

As a result, the user got a domain ontology as in Fig. 12 

5 Related Work 

Navigli et,al. proposed OntoLearn [17] [18], that supports domain ontology con- 
struction by using existing ontologies and natural language processing tech- 
niques. In their approach, existing concepts from WordNet are enriched and 
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Fig. 11. The initial model of the domain taxonomy (A) and the concept drift 
management (B) 



pruned to fit the domain concepts by using NLP (Natural Language Processing) 
techniques. They argue that the automatically constructed ontologies are practi- 
cally usable in the case study of a terminology translation application. However, 
they did not show any evaluations of the generated ontologies themselves that 
might be done by domain experts. Although a lot of useful information is in the 
machine readable dictionaries and documents in the application domain, some 
essential concepts and knowledge are still in the minds of domain experts. We 
did not generate the ontologies themselves automatically, but suggests relevant 
alternatives to the human experts interactively while the experts’ construction 
of domain ontologies. In another case study [19], we had an experience that 
even if the concepts are in the MRD (Machine Readable Dictionary), they are 
not sufficient to use. In the case study, some parts of hierarchical relations are 
counterchanged between the generic ontology (WordNet) and the domain on- 
tology, which are called “Concept Drift”. In that case, presenting automatically 
generated ontology that contains concept drifts may cause confusion of domain 
experts. We argue that the initiative should be kept not on the machine, but 
on the hand of the domain experts at the domain ontology construction phase. 
This is the difference between our approach and Navigli’s. Our human-centered 
approach enabled us to cooperate with human experts tightly. 

From the technological viewpoint, there are two different related research ar- 
eas. In the research using verb-oriented method, the relation of a verb and nouns 
modified with it is described, and the concept definition is constructed from this 
information (e.g. [20]). In [21], taxonomic relationships and Subcategorization 
Frame of verbs (SF) are extracted from technical texts using a machine learning 
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Fig. 12 . Constructed CISG ontology 




Towards On-the-Fly Ontology Construction 



13 



method. The nouns in two or more kinds of different SF with the same frame- 
name and slot-name are gathered as one concept, base class. And ontology with 
only taxonomic relationships is built by carrying out clustering of the base class 
further. Moreover, in parallel, Restriction of Selection (RS) which is slot-value 
in SF is also replaced with the concept with which it is satisfied instantiated SF. 
However, proper evaluation is not yet done. Since SF represents the syntactic re- 
lationships between verb and noun, the step for the conversion to non-taxonomic 
relationships is necessary. 

On the other hand, in ontology learning using data-mining method, discover- 
ing non-taxonomic relationships using an association rule algorithm is proposed 
by [22]. They extract concept pairs based on the modification information be- 
tween terms selected with parsing, and made the concept pairs a transaction. 

By using heuristics with shallow text processing, the generation of a trans- 
action more reflects the syntax of texts. Moreover, RLA, which is their original 
learning accuracy of non-taxonomic relationships using the existing taxonomic 
relations, is proposed. The concept pair extraction method in our paper does 
not need parsing, and it can also run off context similarity between the terms 
appeared apart each other in texts or not mediated by the same verb. 

6 Conclusion and Future Work 

In this paper, we presented a support environment for ontology construction 
named DODDLE-R, which is aiming at becoming a total support environment 
for user-centered on-the-fly ontology construction. Its main principle is that high- 
level support for users through interaction and low dependence on automatic 
machine processing. First, a user identifies the input concepts by associating 
WordNet concepts with terms extracted from a text corpus. Then, pre-processing 
part generates the basis of ontology in the forms of taxonomy and related concept 
pairs, by referring to WordNet as an MRD and a domain specific text corpus. 
The quality improvement part provides management facilities for concept drift 
in the taxonomy and identifying significant concept pairs in extracted related 
concept pairs. In these management, MR 3 provides significant visualization 
support for the user in graph representation of ontologies. As a case study, we 
have constructed an ontology in law domain by exploiting articles in CISG as a 
domain specific text corpus. Comparing with former ontology construction study 
with DODDLE-II, even though the first step, input concept selection phase, takes 
time, other phases are processed fairly well because of the re-organized system 
architecture and the improved user interface in conjunction with MR 3 . Finally, 
the user constructed a law domain ontology by interactive support of DODDLE- 
R and produced an OWL-Lite file, which is able to put on public as a Semantic 
Web ontology. 

We plan further improvement of DODDLE-R to be more flexible ontology 
development environment. At this point, the user interface of DODDLE-R is not 
completely supports users’ trial and error (in other words, go forward and come 
back to particular phases of ontology construction seamlessly) in ontology con- 
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struction. Since we believe that the user interface is one of the most important 
facilities of support tool for ontology construction, it should be improved to the 
point of supporting the user seamlessly. In addition, although DODDLE-R ex- 
tracts domain concepts from text corpus, the extracted terms might be suitable 
not for concepts (classes) but for relationships (properties) or instances (indi- 
viduals). For example, the term ’’time” may be concept or property (or other 
kind of attributes). Because the collaboration with MR 3 realized total manage- 
ment of OWL classes, properties and instances (by its editors for each in sub 
windows and its meta- model management facility), DODDLE-R may be able to 
support the construction of not only ontologies, but also models, which consist 
of individuals and their relationships (properties) . Furthermore, we plan to im- 
plement import facility of other statistic methods. Although DODDLE-R does 
not emphasize the function in pre-processing part, it would be better to prepare 
the import facility of other methods. For instance, there is a machine learning 
software Weka [23], and it contains several machine learning algorithms, which 
may be suitable for extracting related concept pairs from text corpus. If we 
look at quality improvement part of DODDLE-R, there may be many additional 
functions. For instance, for related concept pair extraction by statistic meth- 
ods, a line graph window is suitable for showing the result of applying statistic 
methods, also to check the current status of recall and precision. Additionally, 
in terms of adaptation to the Semantic Web standards, the import and export 
support of other ontology languages, such as DAML+OIL, must be helpful for 
interoperability across other ontology tools. 
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Abstract. Knowledge management solutions relying on central repositories 
sometimes have not met expectations, since users often create knowledge ad- 
hoc using their individual vocabulary and using their own decentral IT infras- 
tructure (e.g., their laptop). To improve knowledge management for such decen- 
tralized and individualized knowledge work, it is necessary to, first, provide a 
corresponding IT infrastructure and to, second, deal with the harmonization of 
different vocabularies/ontologies. In this paper, we briefly sketch the technical 
peer-to-peer platform that we have built, but then we focus on the harmonization 
of the participating ontologies. 

Thereby, the objective of this harmonization is to avoid the worst incongruen- 
cies by having users share a core ontology that they can expand for local use at 
their will and individual needs. The task that then needs to be solved is one of 
distributed, loosely-controlled and evolving engineering of ontologies. We have 
performed along these lines. To support the ontology engineering process in the 
case study we have furthermore extended the existing ontology engineering en- 
vironment, OntoEdit. The case study process and the extended tool are presented 
in this paper. 



1 Introduction 

The knowledge structures underlying today’s knowledge management systems consti- 
tute a kind of ontology that may be built according to established methodologies e.g. 
the one by [1], These methodologies have a centralized approach towards engineering 
knowledge structures requiring knowledge engineers, domain experts and others to per- 
form various tasks such as requirement analysis and interviews. While the user group 
of such an ontology may be huge, the development itself is performed by a — com- 
paratively — small group of domain experts who represent the user community and 
ontology engineers who help structuring. 

In Virtual Organizations [2] , organizational structures change very often, since or- 
ganizations frequently leave or join a network. Therefore, working based on traditional, 
centralized knowledge management systems becomes infeasible. While there are some 
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technical solutions toward Peer-to-Peer knowledge management systems (e.g., [3]) — 
and we have developed a technically sophisticated solution of our own as part of our 
project, SWAP — Semantic Web and Peer-to-Peer [4], traditional methodologies for 
creating and maintaining knowledge structures appear to become unusable like the sys- 
tems they had been developed for in the first place. 

Therefore, we postulate that ontology engineering must take place in a Distributed, 
evolvInG and Loosely-controlled setting. With DILIGENT we here provide a process 
template suitable for distributed engineering of knowledge structures that we plan to 
extend towards a fully worked out and multiply tested methodology in the long run. We 
here show a case study we have performed in the project SWAP using DILIGENT with a 
virtual organization. DILIGENT comprises five main activities of ontology engineering: 
build, local adaptation, analysis, revision, and local update (cf. Section 3). 

The case study (cf. Section 4) suggests that the resulting ontology is indeed shared 
among users, that it adapts fast to new needs and is quickly engineered. With some 
loose control we could ensure that the core ontology remained consistent, though we 
do not claim that it gives a complete view on all the different organizations. 

In the following, we briefly introduce the organizational and technical setting of our 
case study (Section 2). Then we sketch the DILIGENT process template (Section 3), 
before we describe the case study (Section 4). 



2 Problem Setting 

2.1 Organizational Setting at IBIT Case Study 

In the SWAP project, one of the case studies is in the tourism domain of the Balearic 
Islands. The needs of the tourism industry there, which is for 80% of the islands’ econ- 
omy, are best described by the term ‘coopetition’. On the one hand the different or- 
ganizations compete for customers against each other. On the other hand, they must 
cooperate in order to provide high quality for regional issues like infrastructure, facil- 
ities, clean environment, or safety — that are critical for them to be able to compete 
against other tourism destinations. 

To collaborate on regional issues a number of organizations now collect and share 
information about indicators reflecting the impact of growing population and tourist 
fluxes in the islands, their environment and their infrastructures. Moreover, these in- 
dicators can be used to make predictions and help planning. For instance, organiza- 
tions that require Quality & Hospitality management use the information to better plan, 
e.g., their marketing campaigns. As another example, the governmental agency IBIT 3 , 
the Balearic Government’s co-ordination center of telematics, provides the local indus- 
try with information about new technologies that can help the tourism industry to better 
perform their tasks. 

Due to the different working areas and objectives of the collaborating organizations, 
it proved impossible to set up a centralized knowledge management system or even a 
centralized ontology. They asked explicitly for a system without a central server, where 



3 http://www.ibit.org 
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knowledge sharing is integrated into the normal work, but where very different kinds of 
information could be shared with others. 

To this end the SWAP consortium — including us at Univ. of Karlsruhe, IBIT, Free 
Univ. Amsterdam, Meta4, and empolis — have been developing the SWAP generic 
platform and we have built a concrete application on top that allows for satisficing the 
information sharing needs just elaborated. 

2.2 Technical Setting at SWAP 

The SWAP platform (Semantic Web And Peer-to-peer; short Swapster) [4] is a generic 
infrastructure, which was designed to enable knowledge sharing in a distributed net- 
work. Nodes wrap knowledge from their local sources (files, e-mails, etc.). Nodes ask 
for and retrieve knowledge from their peers. For communicating knowledge, Swap- 
ster transmits RDF structures [5], which are used to convey conceptual structures (e.g., 
the definition of what a conference is) as well as corresponding data (e.g., data about 
ESWS-2004). For structured queries as well as for keyword queries, Swapster uses 
SeRQL, an SQL-like query language that allows for queries combining the conceptual 
and the data level and for returning newly constructed RDF-structures. 

In the following we describe only the SWAPSTER components that we refer to later 
in this document (for more see [4]). 

Knowledge Sources: Peers may have local sources of information such as the local file 
system, e-mail directories, local databases or bookmark lists. These local information 
sources represent the peer’s body of knowledge as well as its basic vocabulary. These 
sources of information are the place where a peer can physically store information (doc- 
uments, web pages) to be shared on the network. 

Knowledge Source Integrator: The Knowledge Source Integrator is responsible for 
the extraction and integration of internal and external knowledge sources into the Local 
Node Repository. This task comprises (1) means to access local knowledge sources and 
extract an RDF(S) representation of the stored knowledge, (2) the selection of the RDF 
statements to be integrated into the Local Node Repository and (3 ) the annotation of the 
statements with metadata. These processes utilize the SWAP metadata model presented 
later in this section. 

Local Node Repository:The local node repository stores all information and its meta 
information a peer wants to share with remote peers. It allows for query processing and 
view building. The repository is implemented on top of Sesame [6]. 

User Interface: The User Interface of the peer provides individual views on the infor- 
mation available in local sources as well as on information on the network. The views 
can be implemented using different visualization techniques (topic hierarchies, thematic 
maps, etc). The Edit component described here is realized as a plug-in of the OntoEdit 
ontology engineering environment. 

Communication Adapter: This component is responsible for the network communi- 
cation between peers. Our current implementation of the Communication Adapter is 
build on the JXTA framework [7]. 
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Information and Meta-information. Information is represented as RDF(S) statements 
in the repository. The SWAP meta model 4 ( cf [4]) provides meta-information about the 
statements in the local node repository in order to memorize where the statements came 
from and other meta- information. The SWAP meta model consists of two RDFS classes, 
namely Swabbi and Peer. Every resource is related to an instance of Swabbi in order 
to describe from which instances of Peer it came from, etc. 

Besides the SWAP meta data model the SWAP environment builds on the SWAP 
common ontology. 5 The SWAP common model defines concepts for e.g. File and Folder. 
Purpose of these classes is to provide a common model for information usually found 
on a peer participating in a knowledge management network. 

Querying for Data. SeRQL[8] is an SQL like RDF query language comparable to 
e.g. RQL [9], The main feature of SeRQL that goes beyond the abilities of existing 
languages is the ability to define structured output in terms of an RDF graph that does 
not necessarily coincide with the model that has been queried. This feature is essential 
for defining personalized views in the repository of a SWAP peer. 

OntoEdit. [10] is an ontology engineering environment which allows for inspecting, 
browsing, codifying and modifying ontologies. Modelling ontologies using OntoEdit 
means modelling at a conceptual level, viz. (i) as much as possible independent of a 
concrete representation language, (ii) using graphical user interfaces (GUI) to represent 
views on conceptual structures, i.e. concepts ordered in a concept hierarchy, relations 
with domain and range, instances and axioms, rather than codifying conceptual struc- 
tures in ASCII. 

3 DILIGENT Process 

3.1 Process Overview 

As we have described before, decentralized cases of knowledge sharing, like our ex- 
ample of a virtual organization, require an ontology engineering process that reflects 
this particular organizational setting [ 1 1 ] . 6 Therefore, we have drafted the template of 
such a process — we cannot claim that it is a full-fledged methodology yet. The result, 
which we call DILIGENT, is described in the following. In particular, we elaborate on 
the high-level process, the dominating roles and the functions of DILIGENT, before we 
go through the detailed steps in Sections 3.2. Subsequently, we give the concrete case 
in Section 4 as an indicator for the validity of our ontology engineering process design. 
Key roles: In DILIGENT there are several experts, with different and complementary 
skills, involved in collaboratively building the same ontology. In a virtual organization 
they often belong to competing organizations and are geographically dispersed. Ontol- 
ogy builders may or may not use the ontology. Vice versa, most ontology users will 
typically not build or modify the given ontology. 

Overall process: An initial ontology is made available and users are free to use it and 
modify it locally for their own purposes. There is a central board that maintains and 

4 http://swap.semanticweb.Org/2003/01/swap-peer# 

5 http://swap.semanticweb.Org/2003/01/swap-common# 

6 In fact, we conjecture that the majority of knowledge sharing cases falls into this category. 
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assures the quality of the shared core ontology. This central board is also responsible 
for deciding to do updates to the core ontology. However, updates are mostly based on 
changes re-occurring at and requests by decentrally working users. Therefore the board 
only loosely controls the process. Due to the changes introduced by the users over time 
and the on-going integration of changes by the board, the ontology evolves. Let us 
now survey the DILIGENT process at the next finer level of granularity. DILIGENT 
comprises five main steps: (1) build, (2) local adaptation, (3) analysis, (4) revision, 
(5) local update ( cf Figure 1). 

Build. The process starts by having domain experts, users, knowledge engineers and 
ontology engineers build an initial ontology. In contrast to existing ontology engineer- 
ing methodologies (cf. [12-16]), we do not require completeness of the initial shared 
ontology with respect to the domain. The team involved in building the initial ontol- 
ogy should be relatively small, in order to more easily find a small and consensual first 
version of the shared ontology. 

Local adaptation. Once the core ontology is available, users work with it and, in par- 
ticular, adapt it to their local needs. Typically, they will have their own business require- 
ments and correspondingly evolve their local ontologies (including the common core) 
[17, 18]. In their local environment, they are also free to change the reused core on- 
tology. However, they are not allowed to directly change the core ontology from which 
other users copy to their local repository. Logging local adaptations (either permanently 
or at control points), the control board collects change requests to the shared ontology. 




Fig. 1 . Roles and functions in distributed ontology engineering 

Analysis. The board analyzes the local ontologies and the requests and tries to identify 
similarities in users’ ontologies. Since not all of the changes introduced or requested by 
the users will be introduced to the shared core ontology, 7 a crucial activity of the board 
is deciding which changes are going to be introduced in the next version of the shared 
ontology. The input from users provides the necessary arguments to underline change 

7 The idea in this kind of development is not to merge all user ontologies. 
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requests. A balanced decision that takes into account the different needs of the users 
and meets user’s evolving requirements 8 has to be found. 

Revise. The board should regularly revise the shared ontology, so that local ontologies 
do not diverge too far from the shared ontology. Therefore, the board should have a well- 
balanced and representative participation of the different kinds of participants involved 
in the process: knowledge providers, domain experts, ontology engineers and users. In 
this case, users are involved in ontology development, at least through their requests 
and re-occurring improvements and by evaluating it, mostly from an usability point of 
view. Knowledge providers in the board are responsible for evaluating the ontology, 
mostly from a technical and domain point of view. Ontology engineers are one of the 
major players in the analysis of arguments and in balancing them from a technical 
point of view. Another possible task for the controlling board, that may not always be 
a requirement, is to assure some compatibility with previous versions. Revision can be 
regarded as a kind of ontology development guided by a carefully balanced subset of 
evolving user driven requirements. Ontology engineers are responsible for updating the 
ontology, based on the decisions of the board. Revision of the shared ontology entails 
its evolution. 

Local update. Once a new version of the shared ontology is released, users can update 
their own local ontologies to better use the knowledge represented in the new version. 
Even if the differences are small, users may rather reuse e.g. the new concepts instead 
of using their previously locally defined concepts that correspond to the new concepts 
represented in the new version. 

3.2 Tool Support for DILIGENT Steps 

We support the participants in the DILIGENT process with a tool ( cfi Figure 2). It is 
an implementation of the Edit component of the SWAP environment, thus it works on 
the information stored in the local node repository, and is realized as an OntoEdit plug- 
in. We will now describe in detail how the tool supports the actions building, locally 
adapting, analyzing, revising and locally updating. 

Build 

The first step of the ontology engineering task is covered by established methodologies 
and by common OntoEdit functions. Some major tool functionality includes support 
for knowledge elicitation from domain experts by means of competency questions and 
mind maps and further support for the refinement process. 

In contrast to a common full ontology engineering cycle the objective of this Build 
task is not to generate a complete and evaluated ontology but rather to quickly identify 
and formalize the main concepts and main relations. 



This is actually one of the trends in modern software engineering methodologies (see Rational 
Unified Process). 
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Fig. 2. OntoEdit plug-in to support DILIGENT 



Local Adaptation 

We distinguish two main types of users. The less frequent type is the user with on- 
tology engineering competence who analyzes his personal needs, conceptualizes and 
formalizes them. He uses established ontological guidelines [19] in order to maintain 
soundness and validity. Besides, he annotates his knowledge according to his locally 
extended ontology. 

The more common type of user reuses the categorizations he had defined in his daily 
work before (e.g.his folder structures) and just aligns them with the shared ontology. To 
illustrate this use case we must point forward to some issues we found in the case study. 
In the case study, users expect from a peer-to-peer system primarily the possibility to 
share their documents with others. Users already organize their files in folder structures 
according to their individual views. Hence, they will extend the core ontology with 
concepts and relations corresponding to folder structures found in their file or email 
system. 

Concept creation. Our tool supports the creation of concepts and thus the extension of 
the shared ontology in two ways. The reader may note that both methods have been 
heavily influenced by our targeted system, SWAPSTER, and may be supplemented or 
overridden by other methods for other target systems: 

1 . OntoScrape — part of the SWAPSTER knowledge source integrator — can extract 
information from the user’s local file and email system. OntoScrape extracts e.g. the 
folder hierarchy and builds up an RDFS representation in which the folder names 
are used to create instances of class Folder. This information is stored in the local 





OntoEdit Empowering SWAP 



23 



node repository. Then, the user can pick a set of instances of Folder and create 
concepts or relations using the folder names. In case of “concept creation” he would 
select a certain concept and the system would subclass that concept using the names 
of the previously selected folders. 

The user may also reuse the folder hierarchy given by the inFolder relation to 
construct a SUbClassOf hierarchy. 

2. Furthermore, a user can query other participants for their local subconcepts of the 
core ontology. He can use the gathered information to directly extend his own struc- 
tures by integrating retrieved information. Alternatively, he may use the query result 
only for inspiration and create own extensions and modifications. 

SWAPSTER integrates a component for semi-automatic alignment. Alignment de- 
tection is based on similarities between concepts and relations(cf., e.g., [20]). The 
user may either select a set of classes and ask for proposed alignment for these 
classes, or he can look for alignments for the entire class hierarchy. The reader may 
note that even the best available alignment methods are not very accurate and hence 
some user involvement is required for aligning ontologies. 

We are well aware of the drawbacks of this approach since the created structures will 
not be “clean” ontologies. However, as our case study indicates the created structures 
are good enough to be a fair input for the revision phase. 

Instance assignment. Besides instances of the created concepts the user has mainly in- 
stances of concept Source e.g. Folder and File and wants to relate them to his concepts. 
In particular, documents play a predominant role in our case study. Since the global on- 
tology certainly differs from existing local structures, we face the typical bootstrapping 
problem that the documents need to be aligned with the defined concepts. Our tool 
offers two possibilities to facilitate the assignment of documents to classes. 

Manual Assignment Instances of concept Source can manually be selected and as- 
signed to any concept in the ontology. 

Automatic Assignment Automatic text classification is nowadays very effective. 
Hence we provide an interface for classifiers to suggest document classifications. 
Classifier training can take place remotely for the core ontology or according to es- 
tablished procedures [21]. The classifier has to produce a set of RDFS statements, 
stating which files should be classified where in the concept hierarchy. This has not 
been implemented yet. 

Analyzing 

As described in the methodology, the board will come together in fixed time lines or 
when a certain threshold of change requests has been reached. They will subsequently 
analyze the activities which have taken place. They will gather the ontologies from all 
participating peers on one central peer. The main task of the board is to incorporate the 
change requests into the core ontology and to identify common usage patterns. Our tool 
supports the board members in different ways to fulfill their task. 

View selection. The number of newly created concepts within the peer network can be 
large. The board members can use queries to select only parts of the ontology to be visu- 
alized. Instead of loading the entire local node repository, a SeRQL query can be used 
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to generate a view on the repository. Queries can be defined manually, or predefined 
ones — visualizing certain branches of the ontology — can be selected. 

Colors. The board needs to separate extensions made by different users and is interested 
in their relative activity. Since each peer uses its own name space to create URIs, exten- 
sions to the core made by different peers can be distinguished. The tool highlights the 
concepts, relations and instances of different peers by changing their background color. 
The saturation and brightness of the color indicates the number of concepts coming 
from a particular peer. 9 White is preserved for name spaces which the users can chose 
not to highlight (e.g. the local, swap-peer and swap-common name space are excluded 
from highlighting by default). 

Adaptation rate. The averaged adaptation rate 10 of concepts from the core ontology and 
also of concepts from different users is an indicator of how well a concept fits the user 
needs. If a concept of the core ontology was not accepted by the users it probably has 
to be changed. Alternatively, a concept introduced by a user which has been reused by 
many other users can easily be integrated into the core ontology. The adaptation rate is 
visualized as a tool tip. In our case study e.g. the concept beaches was adapted by all 
users. It is calculated from the information stored in the SWAP data model. 

Visualizing alignments. Instead of reusing concepts from other users, they can align 
them. The semantics of both actions is very similar. However, alignment implies, in 
most cases, a different label for the concept, which is determined by the board. 

Sorting. To facilitate the analysis process, concepts, relations and instances may be 
sorted alphabetically, according to their adaptation rate or the peer activity. Concepts 
with the same label, but from different peers can be identified. Equally the concepts 
reused by most peers may be recognized. 

Revision 

The analysis is followed by the revision of the core ontology. The change requests as 
well as the recognized common usage patterns are integrated. In a traditional scenario 
the knowledge engineer introduces the new concepts and relations or changes the ex- 
isting ones while the system meets the requirements described in [18]. The ontology 
changes must be resolved taking into account that the consistency of the underlying 
ontology and all dependent artifacts are preserved and may be supervised. 

Additionally we require, that the reasons for any change do not require too much 
effort from the individual user. In particular, changes to the core ontology made because 
of overarching commonalities should be easy to integrate for users who created the 
concepts in the first place. 



Local Update 



The changes to the core ontology must be propagated to all peers afterwards. The list of 
changes is transmitted to the different peers by the Advertisement component. Maedche 



9 

10 



Brighter and less saturated means less concepts than darker and more saturated. 

The adaptation rate of a concept indicates how many users have included the concept into their 

.... , . . . No of participant who have locally included the concept 

local ontology: adaptation rate := n — j — 

bJ ' No of participants 
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et al. describes in [22] the necessary infrastructure to enable consistent change propaga- 
tion in a distributed environment. We do not require that all users adapt their ontology 
to the changes introduces by the board members. Furthermore, we allow that they use 
different evolution strategies when they accept changes (see [18] for an overview of 
different strategies). 

After the local update took place the iteration continues with local adaptation. Dur- 
ing the next analysis step the board will review which changes were actually accepted 
by the users. 

4 Case Study 

We are now going to describe how DILIGENT ontology engineering is taking place in 
the IBIT case study and how OntoEdit is supporting it. 

In the case study one organization with seven peers took part. The case study lasted 
for two weeks. The case study will be extended in the future to four organizations 
corresponding to 21 peers and it is expected that the total number of organizations will 
grow to 7 corresponding to 28 peers. 

Building. In the IBIT case study two knowledge engineers were involved in building 
the first version of the shared ontology with the help of two ontology engineers. In 
this case, the knowledge engineers were at the same time also knowledge providers. In 
addition they received additional training such that later, when the P2P network is going 
to be up and running on a bigger scale, they will be able to act as ontology engineers 
on the board. This they did already during this study — together with two two experts 
from the domain area. 

The ontology engineering process started by identifying the main concepts of the 
ontology through the analysis of competency questions and their answers. The most 
frequent queries and answers exchanged by peers were analyzed. The identified con- 
cepts were divided into three main modules: "Sustainable Development Indicators”, 
“New Technologies” and "Quality&Hospitality Management”. From the compe- 
tency questions we quickly derived a first ontology with 22 concepts and 7 relations 
for the “Sustainable Development Indicator” ontology. This was the domain of the 
then participating organizations. The other modules will be further elaborated in future 
efforts. 

Based on previous experience of IBIT with the participants we could expect that 
users would mainly specialize the modules of the shared ontology corresponding to 
their domain of expertise and work. Thus, it was decided by the ontology engineers and 
knowledge providers involved in building the initial version that the shared ontology 
should only evolve by addition of new concepts, and not from other more sophisticated 
operations, such as restructuring or deletion of concepts. 

Local Adaptation. The developed core ontology for "Sustainable Development In- 
dicator” was distributed among the users and they were asked to extend it with their 
local structures. With assistance of the developers they extracted on average 14 folders. 
The users mainly created sub concepts of concepts in the core ontology from the folder 
names. In other cases they created their own concept hierarchy from their folder struc- 
ture and aligned it with the core ontology. They did not create new relations. Instance 
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assignment took place, but was not significant. We omitted the use of the automatic 
functions to get a better grasp of the actions the users did manually. 

Analyzing. The members of the board gathered the evolving structures and analyzed 
them with help of the OntoEdit plug-in. The following observations were made: 

Concepts matched A third of the extracted folder names was directly aligned with the 
core ontology. A further tenth of them was used to extend existing concepts. 
Folder names indicate relations In the core ontology a relation inYear between the 
concept Indicator and Temporal was defined. This kind of relation is often encoded 
in one folder name. e.g. the folder name "Sustlnd2002" matches the concepts 
Sustainable Indicator and Year 11 . It also points to a modelling problem, since 
Sustainable Indicator is a concept while “2002” is an instance of concept Year. 
Missing top level concepts The concept project was introduced by more than half of 
the participants, but was not part of the initial shared ontology. 

Refinement of concepts The top level concept Indicator was extended by more than 
half of the participants, while other concepts were not extended. 

Concepts were not used Some of the originally defined concepts were never used. We 
identified concepts as used, when the users created instances, or aligned documents 
with them. A further indicator of usage was the creation of sub concepts. 

Folder names represent instances The users who defined the concept project used 
some of their folder names to create instances of that concept e.g. “Sustainable 
indicators project”. 

Different labels The originally introduced concept Natural spaces was often aligned 
with a newly created concept Natural environments and never used itself. 
Ontology did not fit One user did create his own hierarchy and could use only one of 
the predefined concepts. Indeed his working area was forgotten in the first ontology 
building workshop. 

From the discussions with the domain experts we have the impression that the local 
extensions are a good indicator for the evolution direction of the core ontology. How- 
ever, since the users made use of the possibility to extend the core ontology with their 
folder names, as we expected, the resulting local ontologies represent the subjects of 
the organized documents. Therefore, a knowledge engineer is still needed to extend the 
core ontology, but the basis of his work is being improved significantly. From our point 
of view there is only a limited potential to automate this process. 

Revision. The board extended the core ontology where it was necessary and per- 
formed some renaming. More specifically the board introduced ( 1 ) one top level concept 
( Project) and (2) four sub concepts of the top level concept Indicator and one for the 
concept Document. The users were further pointed to the possibility to create instances 
of the introduced concepts. E.g. some folder names specified project names, thus could 
be enriched by such an annotation. 

Local update. The extensions to the core ontology were distributed to the users. The 
general feedback of the users was generally positive. However, due to the early devel- 
opment stage of the SWAP environment a prolonged evaluation of the user behavior 
and second cycle in the ontology engineering process has not yet been performed. 

11 Year is sub class of class Temporal 




OntoEdit Empowering SWAP 



27 



5 Lessons Learned 

The case study helped us to generally better comprehend the use of ontologies in a 
peer-to-peer environment. First of all our users did understand the ontology mainly as a 
classification hierarchy for their documents. Hence, they did not create instances of the 
defined concepts. However, our expectation that folder structures can serve as a good 
input for an ontology engineer to build an ontology was met. 

Currently we doubt that our manual approach to analyzing local structures will scale 
to cases with many more users. Therefore, we look into technical support to recognize 
similarities in user behavior. Furthermore, the local update will be a problem when 
changes happen more often. Last, but not least, we have so far only addressed the on- 
tology creation task itself - we have not yet measured if users get better and faster 
responses with the help of DILIGENT -engineered ontologies. All this remains work to 
be done in future. 

In spite of the technical challenges, user feedback was very positive since (i) the 
tool was integrated into their daily work environment and could be easily used and (ii) 
the tool provided very beneficial support to perform their tasks. 



6 Related Work 

An extensive state-of-the-art overview of methodologies for ontology engineering can 
be found in ( cf [14]). We here briefly present some of the most well-known ontology 
engineering methodologies. 

CommonKADS [ 1 ] is not per se a methodology for ontology development. It covers 
aspects from corporate knowledge management, through knowledge analysis and engi- 
neering, to the design and implementation of knowledge-intensive information systems. 
CommonKADS has a focus on the initial phases for developing knowledge manage- 
ment applications, one can therefore make use of CommonKADS e.g. for early feasi- 
bility stages. 

Methontology [14] is a methodology for building ontologies either from scratch, 
reusing other ontologies as they are, or by a process of re-engineering them. The frame- 
work consists of: identification of the ontology development process where the main 
activities are identified (evaluation, configuration, management, conceptualization, inte- 
gration implementation, etc.); a lifecycle based on evolving prototypes; and the method- 
ology itself, which specifies the steps to be taken to perform each activity, the techniques 
used, the products to be output and how they are to be evaluated. 

Even though Methontology already mentions evolving prototypes, none of these 
(and similar others) methodologies responds to the requirements for distributed, loosely 
controlled and dynamic ontology engineering. 

There exists a plethora of ’ontology editors’. We briefly compare two of the most 
well-known ones to OntoEdit viz. Protege and WebODE. The design of Protege [23] is 
very similar to OntoEdit since it actually was the first editor with an extensible plug-in 
structure and it also relies on the frame paradigm for modelling. Numerous plug-ins 
from external developers exist. WebODE [24] is an ontology engineering workbench 
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that provides various services for ontology engineering. Similar to OntoEdit it is ac- 
companied by a sophisticated methodology of ontology engineering, see above Methon- 
tology. However, no support of these tools is so far known for distributed, loosely con- 
trolled and evolving ontology engineering such as we have presented for OntoEdit. 

There are a number of technical solutions to tackle problems of remote collabora- 
tion, e.g. ontology editing with mutual exclusion [25, 26], inconsistency detection with 
a voting mechanism [27] or evolution of ontologies by different means [17,18,22], 
APECKS [28] allows users to discuss different modelling decisions online. All these 
solutions address the issue of keeping an ontology consistent. Obviously, none supports 
(and do not intend to) the work process of the ontology engineers by way of a method- 
ology. 

The development of the National Cancer Institute Thesaurus [29] could be an in- 
teresting application scenario for DILIGENT, because their processes seem to follow 
our process templates. However, they focus on the creation of the thesaurus itself rather 
than on a generalizable methodology. 



7 Conclusion 

It is now widely agreed that ontologies are a core enabler for the Semantic Web vision. 
The development of ontologies in centralized settings is well studied and established 
methodologies exist. However, current experiences from projects suggest, that ontology 
engineering should be subject to continuous improvement rather than a one time action 
and that ontologies promise the most benefits in decentralized rather than centralized 
systems. Hence, a methodology for distributed, loosely-controlled and dynamic ontol- 
ogy engineering settings is needed. The current version of DILIGENT is a step towards 
such a methodology. 

DILIGENT comprises the steps Build, Local Adaptation, Analysis, Revision and 
Local Update and introduces a board to supervise changes to a shared core ontology. 
The DILIGENT methodology is supported by an OntoEdit plug-in, which is an im- 
plementation of the Edit component in the SWAP system. The plug-in supports the 
board mainly in recognizing changes to the core ontology by different users during the 
analysis and revision steps and highlights commonalities. It thus supports the user in 
extending and changing the core. 

We have applied the methodology with good results in a case study at IBIT, one of 
the partners of the SWAP project. We found that the local extensions are very document 
centered. Though we are aware that this may often lead to unclean ontologies, we be- 
lieve it to be one (of many) important step(s) towards creating a practical semantic web 
in the near future. 

Acknowledgements. Research reported in this paper has been partially financed by 
EU in the 1ST project SWAP (IST-200 1-34 103), the 1ST thematic network Onto Web 
(IST-2000-29243), the 1ST project SEKT (IST-2003-506826) and Fundagao Calouste 
Gulbenkian (21-63057-B). In particular we want to thank Immaculada Salamanca and 
Esteve Llado Marti from IBIT for the fruitful discussions and the other people in the 
SWAP team for their collaboration towards SWAPSTER. 




OntoEdit Empowering SWAP 



29 



References 

1. Schreiber, G., et al. : Knowledge Engineering and Management — The CommonKADS 
Methodology. The MIT Press, Cambridge, Massachusetts; London, England (1999) 

2. Camarinha-Matos, L.M.. Afsarmanesh, H., eds.: Processes and Foundations for Virtual Or- 
ganizations. Volume 262 of IFIP INTERNATIONAL FEDERATION FOR INFORMATION 
PROCESSIN. Kluwer Academic Publishers (2003) 

3. Bonifacio, M., Bouquet, P., Marneli, G., Nori, M.: Peer-mediated distributed knowldege 
management. [30] To appear 2003. 

4. Ehrig, M., Haase, P., van Harmelen, F., Siebes, R., Staab, S., Stuckenschmidt, H., Studer, R., 
Tempich, C.: The swap data and metadata model for semantics-based peer-to-peer systems. 
In: Proceedings of MATES-2003. First German Conference on Multiagent Technologies. 
LNAI, Erfurt, Germany, Springer (2003) 

5. Klyne, G., Carroll, J.J.: Resource Description Framework (RDF): Concepts and abstract 
syntax. http://www.w3.org/TR/rdf-concepts/ (2003) 

6. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A generic architecture for storing 
and querying RDF and RDFSchema. [31] 54-68 

7. Gong. L.: Project JXTA: A technology overview. Technical report, Sun Micros. Inc. (2001) 

8. Broekstra, J.: SeRQL: Sesame RDF query language. In Ehrig, M., et al., eds.: SWAP Deliv- 
erable 3.2 Method Design. (2003) 55-68 

9. Karvounarakis, G., et al.: Querying RDF descriptions for community web portals. In: Pro- 
ceedings of The French National Conference on Databases 2001 (BDA'01), Agadir, Maroc 
(2001) 133-144 

10. Sure, Y., Angele, J., Staab, S.: OntoEdit: Multifaceted inferencing for ontology engineering. 
Journal on Data Semantics, LNCS 2800 (2003) 128-152 

11. Pinto, H.S., Martins, J.: Evolving Ontologies in Distributed and Dynamic Settings. InFensel, 
D., Giunchiglia, F., McGuinness, D., Williams, M., eds.: Proc. of the 8th Int. Conf. on Princi- 
ples of Knowledge Representation and Reasoning (KR2002), San Francisco, Morgan Kauf- 
rnann (2002) 365-374 

12. Staab, S., Schnurr, H.P., Studer, R., Sure, Y.: Knowledge processes and ontologies. IEEE 
Intelligent Systems 16 (2001) Special Issue on Knowledge Management. 

13. Gangemi, A., Pisanelli, D., Steve, G.: Ontology integration: Experiences with medical ter- 
minologies. In Guarino, N., ed.: Formal Ontology in Information Systems, Amsterdam, IOS 
Press (1998) 163-178 

14. Gomez-Perez, A., Femandez-Lopez, M., Corcho, O.: Ontological Engineering. Advanced 
Information and Knowlege Processing. Springer (2003) 

15. Pinto, H.S., Martins, J.: A Methodology for Ontology Integration. In: Proc. of the First Int. 
Conf. on Knowledge Capture (K-CAP2001), New York, ACM Press (2001) 131-138 

16. Uschold, M., King, M.: Towards a methodology for building ontologies. In: Proc. of IJ- 
CAI95’s WS on Basic Ontological Issues in Knowledge Sharing, Montreal, Canada (1995) 

17. Noy, N., Klein, M.: Ontology evolution: Not the same as schema evolution. Knowledge and 
Information Systems (2003) 

18. Stojanovic, L., et al.: User-driven ontology evolution management. In: Proc. of the 13th 
Europ. Conf. on Knowledge Eng. and Knowledge Man. EKAW, Madrid, Spain (2002) 

19. Guarino, N., Welty, C.: Evaluating ontological decisions with OntoClean. Communications 
of the ACM 45 (2002) 61-65 

20. Noy, N., Musen, M.: The PROMPT suite: Interactive tools for ontology merging and map- 
ping. Technical report, SMI, Stanford University, CA, USA (2002) 

21 . Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 
34 (2002) 1-47 




30 



S. Pinto et al. 



22. Maedche, A., Motik, B., Stojanovic, L.: Managing multiple and distributed ontologies on 
the semantic web. The VLDB Journal 12 (2003) 286-302 

23. Noy, N., Fergerson, R., Musen, M.: The knowledge model of Protege-2000: Combining 
interoperability and flexibility. In Dieng, R., Corby. O., eds.: Proc. of the 12th Int. Conf. on 
Knowledge Eng. and Knowledge Man.: Methods, Models, and Tools (EKAW 2000). Volume 
1937 of LNAI., Juan-les-Pins, France, Springer (2000) 17-32 

24. Arpfrez, J.C., et al.: WebODE: a scalable workbench for ontological engineering. In: Pro- 
ceedings of the First Int. Conf. on Knowledge Capture (K-CAP) Oct. 21-23, 2001, Victoria, 
B.C., Canada. (2001) 

25. Farquhar, A., et al.: The ontolingua server: A tool for collaborative ontology construction. 
Technical report KSL 96-26, Stanford (1996) 

26. Sure, Y., Erdmann, M„ Angele, J., Staab, S., Studer, R., Wenke, D.: OntoEdit: Collaborative 
ontology development for the semantic web. [31] 221-235 

27. Pease, A., Li, J.: Agent-mediated knowledge engineering collaboration. [30] 405 — 415 

28. Tennison, J., Shadbolt, N.R.: APECKS: a Tool to Support Living Ontologies. In Gaines, 
B.. Musen, M., eds.: 11th Knowledge Acquisition for Knowledge-Bases Systems Workshop 
(KAW98). (1998) 1-20 

29. Golbeck, J., Fragoso, G., Hartel, F., Hendler, J., Parsia, B., Oberthaler, J.: The national cancer 
institute’s thesaurus and ontology. Journal of Web Semantics 1 (2003) 

30. van Elst, L., et al., eds. LNAI. Springer, Berlin (2003) 

31. Horrocks, I., Hendler, J., eds. In Horrocks, I., Hendler, J., eds.: Proc. of the 1st Int. Semantic 
Web Conf. (ISWC 2002). Volume 2342 of LNCS., Sardinia, IT, Springer (2002) 




A Protege Plug-In for Ontology Extraction from Text 
Based on Linguistic Analysis 



Paul Buitelaar 1 , Daniel Olejnik 1 , Michael Sintek 2 



1 DFKI GmbH, Language Technology, Stuhlsatzenhausweg 3, 
66123 Saarbruecken, Germany 
(paulb, oleinik }@dfki . de 

2 DFKI GmbH, Knowledge Management, Erwin-Schrodinger-StraBe, 
67608 Kaiserslautern, Germany 
sintek@df ki . de 



Abstract. In this paper we describe a plug-in (OntoLT) for the widely used 
Protege ontology development tool that supports the interactive extraction 
and/or extension of ontologies from text. The OntoLT approach provides an 
environment for the integration of linguistic analysis in ontology engineering 
through the definition of mapping rules that map linguistic entities in annotated 
text collections to concept and attribute candidates (i.e. Protege classes and 
slots). The paper explains this approach in more detail and discusses some ini- 
tial experiments on deriving a shallow ontology for the neurology domain from 
a corresponding collection of neurological scientific abstracts. 



1 Introduction 

With a recent increase in developments towards knowledge-based applications such 
as Intelligent Question- Answering, Semantic Web Services and Semantic-Level Mul- 
timedia Search, the interest in large-scale ontologies has increased. Additionally, as 
ontologies are domain descriptions that tend to evolve rapidly over time and between 
different applications (see e.g. Noy and Klein, 2002) there has been an increasing 
development in recent years towards learning or adapting ontologies dynamically, 
e.g. by analysis of a corresponding knowledge base (Deitel et al., 2001, Suryanto and 
Compton, 2001) or document collection. 

Most of the work in ontology learning has been directed towards learning ontolo- 
gies from text 1 . As human language is a primary mode of knowledge transfer, ontol- 
ogy learning from relevant text collections seems indeed a viable option as illustrated 
by a number of systems that are based on this principle, e.g. ASIUM (Faure et al., 
1998), TextToOnto (Maedche and Staab, 2000; Maedche) and Ontolearn (Navigli et 
al., 2003). All of these combine a certain level of linguistic analysis with machine 



1 See for instance the overview of ontology learning systems and approaches in OntoWeb 
deliverable 1.5 (Gomez-Perez et al., 2003). 
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learning algorithms to find potentially interesting concepts and relations between 
them (see also Maedche, 2003). 

A typical approach in ontology learning from text first involves term extraction 
from a domain-specific corpus through a statistical process that determines their rele- 
vance for the domain corpus at hand. These are then clustered into groups with the 
purpose of identifying a taxonomy of potential classes. Subsequently also relations 
can be identified by computing a statistical measure of ‘connectedness’ between iden- 
tified clusters. 

The OntoLT approach follows a similar procedure, but we aim also at more di- 
rectly connecting ontology engineering with linguistic analysis. Through the use of 
mapping rules between linguistic structure and ontological knowledge, linguistic 
knowledge (context words, morphological and syntactic structure, etc.) remains asso- 
ciated with the constructed ontology and may be used subsequently in its application 
and maintenance, e.g. in knowledge markup, ontology mapping and ontology evolu- 
tion. 



2 OntoLT 

The OntoLT approach (introduced in Buitelaar et ah, 2003) is available as a plug-in 
for the widely used Protege ontology development tool 2 , which enables the definition 
of mapping rules with which concepts (Protege classes) and attributes (Protege slots) 
can be extracted automatically from linguistically annotated text collections. A num- 
ber of mapping rules are included with the plug-in, but alternatively the user can 
define additional rules. 

The ontology extraction process is implemented as follows. OntoLT provides a 
precondition language, with which the user can define mapping rules. Preconditions 
are implemented as XPATH expressions over the XML-based linguistic annotation. If 
all constraints are satisfied, the mapping rule activates one or more operators that 
describe in which way the ontology should be extended if a candidate is found. 

Predefined preconditions select for instance the predicate of a sentence, its linguis- 
tic subject or direct object. Preconditions can also be used to check certain conditions 
on these linguistic entities, for instance if the subject in a sentence corresponds to a 
particular lemma (the morphological stem of a word). The precondition language 
consists of Terms and Functions, to be discussed in more detail in section 4.2. 

Selected linguistic entities may be used in constructing or extending an ontology. 
For this purpose, OntoLT provides operators to create classes, slots and instances. 
According to which preconditions are satisfied, corresponding operators will be acti- 
vated to create a set of candidate classes and slots that are to be validated by the user. 
Validated candidates are then integrated into a new or existing ontology. 



2 http://protcgc.staiiford.cdii 
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Figure 1: Overview of the OntoLT Approach 



3 Linguistic Annotation 

Linguistic annotation is not integrated with OntoLT, but is accessed via an XML- 
based exchange format, which integrates multiple levels of linguistic and semantic 
analysis in a multi-layered DTD with each analysis level (e.g. morphological, syntac- 
tic and dependency structure) organized as a separate track with options of reference 
between them via indices 3 . 

Linguistic annotation is currently provided by SCHUG, a rule-based system for 
German and English analysis (Declerck, 2002) that implements a cascade of increas- 
ingly complex linguistic fragment recognition processes. SCHUG provides annota- 
tion of part-of-speech (through integration of TnT: Brants, 2000), morphological 
inflection and decomposition (based on Mmorph: Petitpierre and Russell, 1995), 
phrase and dependency structure (head-complement, head-modifier and grammatical 
functions). 

In Figure 2, we present a section of the linguistic annotation for the following sen- 
tence (German with corresponding sentence from the English abstract): 



3 The format presented here is based on proposals and implementations described in (Buitelaar 
et al., 2003) and (Buitelaar and Declerck, 2003). 
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An 40 Kniegelenkprdparaten wurden mittlere Patellarsehnendrittel mit einer neu- 
en Knochenverblockungstechnik in einem zweistufigen Bohrkanal bzw. mit konventi- 
oneller Interferenzschraubentechnik femoral fixiert. 

(In 40 human cadaver knees, either a mid patellar ligament third with a trapezoid 
bone block on one side was fixed on the femoral side in a 2-diameter drill hole, or a 
conventional interference screw fixation was applied.) 

The linguistic annotation for this sentence consists of part-of-speech and lemmati- 
zation information in the <text> level, phrase structure (including head-modifier 
analysis) in the <phrases> level and grammatical function analysis in the 
<clauses> level (in this sentence there is only one clause, but more than one clause 
per sentence is possible). 

Part-of-speech information consists of the correct syntactic class (e.g. noun, verb) 
for a particular word given its current context. For instance, the word works will be 
either a verb (working the whole day) or a noun ( all his works have been sold). 

Morphological information consists of inflectional, derivational or compound in- 
formation of a word. In many languages other than English the morphological system 
is very rich and enables the construction of semantically complex compound words. 
For instance the German word Kreuzbandverletzung corresponds in English with 
three words: cruciate ligament injury’. 

Phrase structure information consists of an analysis of the syntactic structure of a 
sentence into constituents that are headed by an adjective, a noun or a preposition. 
Additionally, the internal structure of the phrase will be analyzed and represented, 
which includes information on modifiers that further specify the head. For instance, 
in the nominal phrase neue Technik (new technology) the modifier neu further speci- 
fies the head Technik. 

Clause structure information consists of an analysis of the core semantic units 
(clauses) in a sentence with each clause consisting of a predicate (mostly a verb) with 
its arguments and adjuncts. Arguments are expressed by grammatical functions such 
as the subject or direct object of a verb. Adjuncts are mostly prepositional phrases, 
which further specify the clause. For instance, in John played football in the garden 
the prepositional phrase in the garden further specifies the clause “play (John, foot- 
ball) ". 

All such information is provided by the annotation format that is illustrated in Fig- 
ure 2 below. For instance, the direct object (DOBJ) in the sentence above (or rather 
in clause ell) covers the nominal phrase p2, which in turn corresponds to tokens t5 
to tIO ( mittlere Patellarsehnendrittel mit einer neuen Knochenverblockungstechnik). 
As token t6 is a German compound word, a morphological analysis is included that 
corresponds to lemmas t6.il, t6.!2, t6.!3. 
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<sentence id="s3" stype-'decl" corresp=" "> 

<clauses> 

<clause id="cl1" from="p1" to="p5" pred="p5" type="pass"> 
<arg id="a1" type="SUBJ" phrase="none" /> 

<arg id="a2" type=”IOBJ" phrase="p1"/> 

<arg id="a3" type="DOBJ" phrase="p2" /> 

<arg id=”a4" type="PP_ADJ" phrase="p3"/> 

</clause> 

</clauses> 

<phrases> 

<phrase id="p2" from="t5" to="t10" type="NP"> 

<mod from="t5" to="t5" /> 

<head from="t6" to=”t6" /> 

<mod_post from="t7" to="t10" /> 

</phrase> 

</phrases> 

<text> 

<token id="t1" pos="APPR" str= M An ,, > 

<lemma id="t1.H">an</lemma> 

</token> 

<token id="t2" pos="CARD" str="40" /> 

<token id="t3" pos="NN" str— Kniegelenkpraeparaten > 
<lemma id="t3.M">Kniegelenk</lemma> 

<lemma id="t3.l2">Praeparat</lemma> 

</token> 

<token id="t4" pos="VAFIN" str="wurden"> 

<lemma id="t4.H">werden</lemma> 

</token> 

<token id="t5" pos="ADJA" str="mittlere"> 

<lemma id="t5.H">mittler</lemma> 

</token> 

<token id="t6" pos="NN" str="Patellarsehnendrittel"> 
<lemma id="t6.H">patellar</lemma> 

<lemma id="t6.l2">Sehne</lemma> 

<lemma id="t6.l3">Drittel</lemma> 

</token> 

<token id="t19" pos="ADJD" str="femoral" /> 

<token id="t20" pos="VVPP" str="fixiert"> 

<lemma id="t6.M">fixieren</lemma> 

</token> 

<token id="t21" pos="PUNCT" str="." /> 

</text> 

</sentence> 



Figure 2: Linguistic Annotation Example 
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4 Ontology Extraction from Text with OntoLT 

The ontology extraction process is implemented as follows. OntoLT provides a pre- 
condition language with which the user can define mapping rules. Preconditions are 
implemented as XPATH expressions over the linguistic annotation. If the precondi- 
tion is satisfied, the mapping rule activates one or more operators that describe in 
which way the ontology should be extended if a candidate is found. 



4.1 Mapping Rules 

A number of mapping rules are predefined and included with the OntoLT plug-in, but 
alternatively the user may define additional mapping rules, either manually or by the 
integration of a machine learning process. In Figure 3, two rules are defined for map- 
ping information from the linguistic annotation to potential Protege classes and slots: 

• HeadNounToClass_ModToSubClass maps a head-noun to a class and in 
combination with its modifier(s) to one or more sub-class(es) 

• SubjToClass PredToSlot DObjToRange maps a linguistic subject to a 
class, its predicate to a corresponding slot for this class and the direct object to 
the “range” of this slot. 




Figure 3: Example Mappings in OntoLT 



